This module will cover:
which will require the following skills already covered:
We will also touch on choosing an appropriate visualization, interactive graphics, and maps.
Data visualization in R can be quick and dirty (ie. data exploration for yourself) or highly polished (ie. communicating to others). We have already touched on quick data exploration in the third module yesterday. Today we will cover content to produce a more polished looking plot.
R vs ggplotPlotting in base R can allow the user to create highly customized plots. This customization takes time, and requires many decisions. An alternative is to use the package ggplot2 developed by Hadley Wickham based on the Grammer of Graphics written by Leland Wilkinson. ggplot2 has its own unique syntax that is a bit different from base R. I will walk through an example using base R and then recreate the figure using ggplot2. For even more side-by-side examples, see Nathan Yau’s blog post on Flowing Data.
RA simple plot can take many more lines of code than you expect based on the visualization. When plotting in base R you’ll use a handful of parameter settings in either par() or in the plotting related functions listed below.
Let’s create a plot of the total population by county area for 5 midwest states. This data is part of the ggplot2 package. I start with the basic scatterplot function plot() and then customize from there.
library(ggplot2) #load the package with the data
data("midwest", package = "ggplot2") #load the data, midwest is now in the working environment.
plot(y=log10(midwest$poptotal), x=midwest$area, #call the x and y values
col=as.factor(midwest$state), #point colors should be based on state
pch=19, cex=.75,#point shape and size
ylim=c(3,7), xlim=c(0,.1), #set the axis limites
las=1, #rotate the axis labels
xlab="Area", ylab=expression('Log'[10]*'(Total population)'),#label the axis
main ="Area vs population"#add a title
)
This is where the true power of plotting with base R customization shows. You can change the axis ticks and lables, add text anywhere, and even create multiple figures in a single visualization. The most common addition to any visualization will be the legend since they are not automatically created when plotting with base R. You have to add them manually. There are a few different methods to do this, but the function legend() works in most cases. To add the legend to the plot above, run the legend() function following the plot() function.
legend("topright", col=c(1:5), pch=19,legend=levels(as.factor(midwest$state)))
The visualization would then look like this:
A grid of plots in base R can be created using parameter setting mfrow or cfrow. Base R also gives you the option to make inset or subplots like this example here where the boxplot is inside the histogram.
x <- rnorm(100,sd=0.5)
y <- rbinom(100, 1, 0.5)
par(fig = c(0,1,0,1))
hist(x)
par(fig = c(0.07,0.5, 0.5, 1), new = T)
boxplot(x ~ y)
The layout() function allows the user to create multipanel plots of different sizes, like this:
# One figure in row 1 and two figures in row 2
# row 1 is 1/3 the height of row 2
# column 2 is 1/4 the width of the column 1
attach(mtcars)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE),
widths=c(3,1), heights=c(1,2))
hist(wt)
hist(mpg)
hist(disp)
Here is an example of figure that Reni made using base R that has lots of information layered into a single visualization.
Multiple plot types in single pane
If you’re interested in other customizations in base R check out DREW’S BOOK GRAB TITLE
Below is an example from Selva Prabhakaran’s Tutorial. Prabhakaran has also compiled a list of 50 different visualizations along with the code here.
For more detailed examples, check out the R Graphics Cookbook by Winston Chang.
# install.packages("ggplot2")
# load package and data
options(scipen=999) # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.
# midwest <- read.csv("http://goo.gl/G1K41K") # bkup data source
# Scatterplot
gg <- ggplot(midwest, aes(x=area, y=log10(poptotal))) +
geom_point(aes(col=state)) +
# geom_smooth(method="loess", se=F) +
xlim(c(0, 0.1)) +
#ylim(c(0, 500000)) +
labs(subtitle="Area Vs Population",
y="Population",
x="Area",
title="Scatterplot",
caption = "Source: midwest")
#, size=popdensity
plot(gg)
The plotly package is an add on to ggplot2 for quick interactive plots. The package is still relatively new and is under current development. The legends or other features are often poorly displayed but the interactive feature maybe useful for data exploration during an inperson meeting.
Below is an example from the plotly website. You’ll notice the syntax is similar to ggplots but the functions have changed a bit.
library(plotly)
p <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length,
marker = list(size = 10, color = 'rgba(255, 182, 193, .9)', line = list(color = 'rgba(152, 0, 0, .8)', width = 2))) %>%
layout(title = 'Styled Scatter', yaxis = list(zeroline = FALSE), xaxis = list(zeroline = FALSE))
p #plot the interactive graphic
## Animated plots
There are plenty of guides on how to create the “best” visualization.
Visualization Groups by Dr. Andrew Abela
If you’re plotting data to communicate (which is normally the case), some things you should keep in mind:
For more details see Ten guidelines for effective data visualization in scientific publications by Kelleher and Wagener, 2011